Why is this relevant?
In learning R today, we’ll take an approach, used in R for Data Science (Wickham & Grolemund, 2016), diagrammed like this:
One of the strengths of R: every step of this workflow can be done using it!
https://www.rstudio.com/products/rstudio/download3/#download
install.packages('tidyverse')
A package only needs to be installed once (per major version, e.g. 3.3.x to 3.4.x), but must be loaded every time.
Data on the fuel efficiency of 38 models of cars between 1999 and 2008 from the US EPA:
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl cla…
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p com…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p com…
## 3 audi a4 2 2008 4 manu… f 20 31 p com…
## 4 audi a4 2 2008 4 auto… f 21 30 p com…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p com…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p com…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p com…
## 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p com…
## 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p com…
## 10 audi a4 q… 2 2008 4 manu… 4 20 28 p com…
## # ... with 224 more rows
aes)hwy vs cyl.class vs drv?What’s going on here? Is this useful? How might we make it more useful?
What about those outliers? Use the color aesthetic.
What’s happening here?
What’s happened here? What colour are the points? Why?
mpg variable types: which are categorical, which are continuous??mpgaes(color = displ < 5)?**TODO: This might be a good point to talk about formulas - since we use the formula notation with facets.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
?geom_smooth
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv, linetype = drv))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
A common scenario:
## [1] 0.15
## [1] 44.66667
## [1] 1
Calling R functions, in general:
function_name(arg1 = val1, arg2 = val2, ...)
An example function, seq:
## [1] 1 2 3 4 5 6 7 8 9 10
dplyr and tidyr (tidyverse)We’ll look at CANSIM Tables 0477-0058 and 0477-0059 which cover university revenues and expenditures, respectively.
TODO: Should we change the examples from CANSIM to a built in dataset?
To save time, I’ve already downloaded the data from CANSIM. You can get it here:
We’ll use read_csv() from readr, which I prefer to the base R function, as it has some more sensible defaults, and it’s faster.
# http://www20.statcan.gc.ca/tables-tableaux/cansim/csv/04770058-eng.zip
# make sure the zip file is in the same directory as your RStudio project
revenue_raw <- read_csv('04770058-eng.zip') ## Parsed with column specification:
## cols(
## Ref_Date = col_character(),
## GEO = col_character(),
## SCHOOL = col_character(),
## REVENUE = col_character(),
## FUND = col_character(),
## Vector = col_character(),
## Coordinate = col_character(),
## Value = col_double()
## )
The read_csv() function tells us that it guessed the types of the various columns. In this situation, the default guesses are fine, but of course we can force it to treat columns certain ways if we wish.
## # A tibble: 122,240 x 8
## Ref_Date GEO SCHOOL REVENUE FUND Vector Coordinate Value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2000/2001 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 1.62e7
## 2 2001/2002 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 1.72e7
## 3 2002/2003 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 1.85e7
## 4 2003/2004 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.16e7
## 5 2004/2005 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.27e7
## 6 2005/2006 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.47e7
## 7 2006/2007 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.65e7
## 8 2007/2008 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.72e7
## 9 2008/2009 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.64e7
## 10 2009/2010 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 3.19e7
## # ... with 122,230 more rows
We have 8 columns and 122,240 rows of data. read_csv() brings the data in as a tibble, which is just an R “data frame”, but with some handy defaults, some of which we’re seeing here. For instance, it gives us the size of the data frame in rows and columns, the types of the columns (e.g., “<chr>” for character) and only prints the first 10 rows, instead of overwhelming us with all of the data.
Or click the icon in the Environment tab.
(See Wickham, 2014 and R for Data Science, chapter 12)
Is the CANSIM data tidy?
filter()).arrange()).select()).mutate()).Collapse many values down to a single summary (summarise()).
All can be used in conjunction with group_by()
https://www.rstudio.com/resources/cheatsheets/
(or Help | Cheatsheets in RStudio!)
filter() the rows we want## # A tibble: 5,632 x 8
## Ref_Date GEO SCHOOL REVENUE FUND Vector Coordinate Value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2000/2001 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 1.62e7
## 2 2001/2002 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 1.72e7
## 3 2002/2003 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 1.85e7
## 4 2003/2004 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.16e7
## 5 2004/2005 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.27e7
## 6 2005/2006 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.47e7
## 7 2006/2007 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.65e7
## 8 2007/2008 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.72e7
## 9 2008/2009 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.64e7
## 10 2009/2010 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 3.19e7
## # ... with 5,622 more rows
filter() help tells us that everything after the first argument (i.e., the “…”) are: “Logical predicates defined in terms of the variables in .data. Multiple conditions are combined with &.”
Other operators we can use with filter():
Complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded regions show which parts each operator selects.
filter()In other words, the previous command is equivalent to this:
## # A tibble: 5,632 x 8
## Ref_Date GEO SCHOOL REVENUE FUND Vector Coordinate Value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2000/2001 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 1.62e7
## 2 2001/2002 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 1.72e7
## 3 2002/2003 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 1.85e7
## 4 2003/2004 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.16e7
## 5 2004/2005 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.27e7
## 6 2005/2006 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.47e7
## 7 2006/2007 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.65e7
## 8 2007/2008 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.72e7
## 9 2008/2009 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 2.64e7
## 10 2009/2010 Canada Total univ… Total r… Total f… v8070… 1.1.1.1 3.19e7
## # ... with 5,622 more rows
I prefer this notation, as it’s more explicit.
filter()But, one more thing: we need to assign the return value of the filter() function back to a variable:
revenue <- filter(revenue_raw,
SCHOOL == 'Total universities and colleges' &
FUND == 'Total funds (x 1,000)')Shortcut for assignment operator: Alt-Hyphen in RStudio
A new variable for each step gets cumbersome, so dplyr provides an operator, the pipe (%>%) that combines operations:
revenue_long <- revenue_raw %>%
# only rows matching this
filter(SCHOOL == 'Total universities and colleges' &
FUND == 'Total funds (x 1,000)') %>%
# remove these columns
select(-SCHOOL, -FUND, -Vector, -Coordinate) %>%
# fix up the date column
mutate(Ref_Date = as.integer(stringr::str_sub(Ref_Date, 1, 4)))x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) etc.
Shortcut for pipe operator: Ctrl-Shift-M in RStudio
## # A tibble: 1 x 4
## Ref_Date GEO REVENUE Value
## <int> <chr> <chr> <dbl>
## 1 2000 Canada Total revenues 16224715
starts_with("abc"): matches names that begin with “abc”.ends_with("xyz"): matches names that end with “xyz”.contains("ijk"): matches names that contain “ijk”.matches("(.)\\1"): selects variables that match a regular expression.num_range("x", 1:3): matches x1, x2 and x3.one_of(vector): columns whose names are in said vector.tidyrThe two main tidyr functions:
gather(data, key, value): Moves column names into a key column, with the values going into a single value column.spread(data, key, value): Moves unique values of key column into column names, with values from the value column.tidyr Cheatsheethttps://www.rstudio.com/resources/cheatsheets/ (Data Import Cheatsheet)
spread() CANSIM## # A tibble: 176 x 34
## Ref_Date GEO `Canada Foundat… `Canada Researc… `Canadian Insti…
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2000 Albe… 8208 1750 42174
## 2 2000 Brit… 5189 1525 26960
## 3 2000 Cana… 212418 19802 329025
## 4 2000 Mani… 7170 550 10780
## 5 2000 New … 1745 300 96
## 6 2000 Newf… 2799 0 2771
## 7 2000 Nova… 3234 0 7521
## 8 2000 Onta… 91066 8533 117466
## 9 2000 Prin… 355 0 145
## 10 2000 Queb… 75988 7144 117205
## # ... with 166 more rows, and 29 more variables: `Credit courses
## # tuiton` <dbl>, `Donations made by business enterprises` <dbl>,
## # `Donations made by individuals` <dbl>, `Donations made by
## # not-for-profit organizations` <dbl>, Endowment <dbl>, Federal <dbl>,
## # Foreign <dbl>, `Grants made by business enterprises` <dbl>, `Grants
## # made by Individuals` <dbl>, `Grants made by not-for-profit
## # organizations` <dbl>, `Health Canada` <dbl>, Investments <dbl>,
## # Miscellaneous <dbl>, Municipal <dbl>, `Natural Sciences and
## # Engineering Research Council` <dbl>, `Non-credit tution` <dbl>,
## # `Non-federal` <dbl>, `Other federal` <dbl>, `Other fees` <dbl>, `Other
## # investments` <dbl>, `Other provinces` <dbl>, `Other revenue
## # type` <dbl>, Provincial <dbl>, `Sale of service and products` <dbl>,
## # `Social Sciences and Humanities Research Council` <dbl>, `Total
## # donations` <dbl>, `Total grants` <dbl>, `Total revenues` <dbl>,
## # `Tuition and other fees` <dbl>
revenue %>%
filter(GEO == 'Canada') %>%
# pass the result to ggplot() as the first argument
ggplot(aes(Ref_Date, `Total revenues`)) +
# now it switches to + to combine, which is ggplot's way
geom_line()Not the prettiest, but it works!
revenue %>%
filter(GEO == 'Canada') %>%
ggplot(aes(Ref_Date, `Total revenues`)) +
geom_line() +
labs(title = 'Total University and Degree-Granting College Revenue',
x = 'Fiscal Year',
y = 'Revenue ($ thousands)') +
scale_y_continuous(labels = scales::comma)revenue %>%
filter(GEO %in% c('Canada', 'Alberta', 'Ontario')) %>%
ggplot(aes(Ref_Date, `Tuition and other fees` / `Total revenues`, color = GEO)) +
geom_line() +
labs(title = 'Tuition and fees as a share of total revenue') +
scale_y_continuous(labels = scales::percent)dplyr “verbs”Which province took in the most tuition revenue in 2010?
revenue %>%
filter(GEO != 'Canada', Ref_Date == 2010) %>%
select(GEO, `Tuition and other fees`) %>%
arrange(desc(`Tuition and other fees`))## # A tibble: 10 x 2
## GEO `Tuition and other fees`
## <chr> <dbl>
## 1 Ontario 3563011
## 2 British Columbia 1009476
## 3 Quebec 825943
## 4 Alberta 711973
## 5 Nova Scotia 305597
## 6 Manitoba 176165
## 7 Saskatchewan 173600
## 8 New Brunswick 146213
## 9 Newfoundland and Labrador 61537
## 10 Prince Edward Island 27878
revenue %>%
select(Ref_Date, GEO, `Tuition and other fees`, `Total revenues`) %>%
mutate(tuition_share = `Tuition and other fees` / `Total revenues`)## # A tibble: 176 x 5
## Ref_Date GEO `Tuition and othe… `Total revenues` tuition_share
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2000 Alberta 298464 1716783 0.174
## 2 2000 British Col… 321017 2020474 0.159
## 3 2000 Canada 3052960 16224715 0.188
## 4 2000 Manitoba 99932 629163 0.159
## 5 2000 New Brunswi… 79153 356899 0.222
## 6 2000 Newfoundlan… 49954 268822 0.186
## 7 2000 Nova Scotia 173980 685133 0.254
## 8 2000 Ontario 1509442 6192454 0.244
## 9 2000 Prince Edwa… 13778 63238 0.218
## 10 2000 Quebec 407458 3613399 0.113
## # ... with 166 more rows
Keeps the same number of rows
+, -, *, /, ^%/% (integer division), %% (remainder)log(), log2(), log10()lead(), lag()cumsum(), cumprod(), cummin(), cummax(), cummean()<, <=, >, >=, !=, ==min_rank(), row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()ifelse()recode()case_when()revenue %>%
group_by(GEO) %>%
summarise(avg_endowment_revenue = mean(Endowment)) %>%
arrange(-avg_endowment_revenue)## # A tibble: 11 x 2
## GEO avg_endowment_revenue
## <chr> <dbl>
## 1 Canada 526071.
## 2 Ontario 211475.
## 3 Quebec 87323.
## 4 Alberta 80069
## 5 British Columbia 75752.
## 6 Nova Scotia 31148.
## 7 Saskatchewan 21256.
## 8 New Brunswick 13778.
## 9 Newfoundland and Labrador 2878.
## 10 Manitoba 1646.
## 11 Prince Edward Island 747.
mean(x), median(x)sd(x), IQR(x) (interquartile range), mad(x) (median absolute deviation)min(x), max(x), quantile(x, 0.25)first(x), nth(x, 2), last(x)n(x), n_distinct(x)sum(x > 10), mean(y == 0)
TRUE is converted to 1 and FALSE to 0Compute the total tuition revenue of the three western-most provinces in 2007.
revenue %>%
filter(GEO %in% c('British Columbia', 'Alberta', 'Saskatchewan'), Ref_Date == 2007) %>%
summarise(sum(`Tuition and other fees`))## # A tibble: 1 x 1
## `sum(\`Tuition and other fees\`)`
## <dbl>
## 1 1374509
These three provinces compared to all other provinces and national average.
revenue %>%
filter(Ref_Date == 2007) %>%
mutate(category = case_when(
GEO %in% c('British Columbia', 'Alberta', 'Saskatchewan') ~ '3 Western Provinces',
GEO == 'Canada' ~ GEO,
TRUE ~ 'All Other Provinces'
)) %>%
group_by(category) %>%
summarise(sum(`Tuition and other fees`))## # A tibble: 3 x 2
## category `sum(\`Tuition and other fees\`)`
## <chr> <dbl>
## 1 3 Western Provinces 1374509
## 2 All Other Provinces 4079866
## 3 Canada 5454375
tuition_relative_change <- revenue %>%
arrange(Ref_Date, GEO) %>%
group_by(GEO) %>%
mutate(rel_change = (`Tuition and other fees` - first(`Tuition and other fees`)) / first(`Tuition and other fees`)) %>%
select(Ref_Date, GEO, `Tuition and other fees`, rel_change)
ggplot(tuition_relative_change, aes(Ref_Date, rel_change, color = GEO)) +
geom_line()Count years with positive endowment income by province
## # A tibble: 11 x 2
## GEO num_pos_endow
## <chr> <int>
## 1 Alberta 13
## 2 British Columbia 13
## 3 Canada 13
## 4 Manitoba 14
## 5 New Brunswick 13
## 6 Newfoundland and Labrador 13
## 7 Nova Scotia 14
## 8 Ontario 12
## 9 Prince Edward Island 13
## 10 Quebec 16
## 11 Saskatchewan 12
purrr (quickly)TODO: I think we remove the purrr section - as per discussion.
revenue %>%
filter(GEO != 'Canada') %>%
group_by(Ref_Date) %>%
summarise(revenue_quantiles = list(quantile(`Total revenues`, c(0.25, 0.5, 0.75)))) %>%
mutate(
low_25 = map_dbl(revenue_quantiles, "25%"),
mid = map_dbl(revenue_quantiles, '50%'),
high_75 = map_dbl(revenue_quantiles, '75%')
)## # A tibble: 16 x 5
## Ref_Date revenue_quantiles low_25 mid high_75
## <int> <list> <dbl> <dbl> <dbl>
## 1 2000 <dbl [3]> 424965 681742. 1944551.
## 2 2001 <dbl [3]> 417664. 700424 2126550.
## 3 2002 <dbl [3]> 465112. 729434 2378510.
## 4 2003 <dbl [3]> 480404. 804636. 2708008.
## 5 2004 <dbl [3]> 530995. 853132. 2818174.
## 6 2005 <dbl [3]> 560023. 911819 3257599.
## 7 2006 <dbl [3]> 616412. 964536. 3320636.
## 8 2007 <dbl [3]> 592927. 944042. 3560364.
## 9 2008 <dbl [3]> 568461. 968932. 3404852.
## 10 2009 <dbl [3]> 741016. 1161066 4363439.
## 11 2010 <dbl [3]> 787082. 1213456 4436956.
## 12 2011 <dbl [3]> 775059. 1169906. 4177081
## 13 2012 <dbl [3]> 799797. 1240542 4296184.
## 14 2013 <dbl [3]> 791843. 1298845 4437751.
## 15 2014 <dbl [3]> 799297. 1316785 4623504.
## 16 2015 <dbl [3]> 770571. 1261936. 4451983.
## [1] NA
## [1] NA
## [1] NA
What about this?
NA / 2
## [1] TRUE
filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:
## # A tibble: 1 x 1
## x
## <dbl>
## 1 3
## # A tibble: 2 x 1
## x
## <dbl>
## 1 NA
## 2 3
## [1] 1 2 3
## [1] 1 2 3
## [1] 2 4 6 5 7 9 8 10 12
What about…
1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
## object length
## [1] 2 4 6 5 7 9 8 10 12 11
What happened?
## [1] 2
What do we expect here?
mean(c(1,NA,3))
## [1] NA
## [1] 2
## x y z
## 1 2 4
## a b c
## 1 2 3
## [1] "three" "two" "five"
## [1] "two" "four"
#> [1] "two" "four"
# subset a named vector w/character vector
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]## xyz def
## 5 2
## [1] 10 3 5 8 1
## [1] 10 NA 8 NA
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
## List of 4
## $ : chr "a"
## $ : int 1
## $ : num 1.5
## $ : logi TRUE
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ :List of 2
## ..$ : num 3
## ..$ : num 4
You are trying to get the model from the family that best fits the data. That doesn’t mean it’s a “good” or “true” model.
The modelr package has a simulated dataset called sim1.
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8
## [24] 8 9 9 9 10 10 10
sim1sim1geom_abline plots straight line on the graphgeom_ablinegeom_ablineThat’s just one model - how can we tell if it’s good or not?
We look at the difference between the predictions and the actual data.
We can find this for any model by creating an R function. We take the intercept and add the slope multiplied by x.
## [1] 12.5 12.5 12.5 13.0 13.0 13.0 13.5 13.5 13.5 14.0 14.0 14.0 14.5 14.5
## [15] 14.5 15.0 15.0 15.0 15.5 15.5 15.5 16.0 16.0 16.0 16.5 16.5 16.5 17.0
## [29] 17.0 17.0
Once we have the model’s prediction, we can look at the difference between those and what the actual data is.
measure_distance <- function(data, mod) {
diff <- data$y - model1(data, mod)
diff
}
measure_distance(sim1, c(12, 0.5))## [1] -8.3000870 -4.9893659 -10.3745272 -4.0111426 -2.7568946
## [6] -1.7031768 -6.1436353 -2.9946506 -2.9883992 -1.5654109
## [11] -2.1073988 0.2579641 4.6300498 -2.7619788 1.5248539
## [16] -1.7260230 0.9559750 1.8947962 4.5859927 1.6718503
## [21] 4.4363088 5.7259025 2.3909129 6.4755526 10.2770099
## [26] 6.3051098 4.6283053 7.9680994 6.3464221 4.9752007
We want to both put all distances as positive numbers and more heavily penalize predictions that are far from the actual values - so we square the diffference.
measure_distance <- function(data, mod) {
diff <- data$y - model1(data, mod)
diff ^ 2
}
measure_distance(sim1, c(12, 0.5))## [1] 68.89144476 24.89377199 107.63081509 16.08926474 7.60046760
## [6] 2.90081117 37.74425497 8.96793224 8.93052986 2.45051128
## [11] 4.44112956 0.06654547 21.43736106 7.62852691 2.32517942
## [16] 2.97915534 0.91388814 3.59025256 21.03132891 2.79508358
## [21] 19.68083613 32.78595957 5.71646454 41.93278203 105.61693163
## [26] 39.75440948 21.42120989 63.49060769 40.27707338 24.75262198
We want to summarize it - so we take the mean across all the predictions and take the square root to put the differences back in terms of the original units.
measure_distance <- function(data, mod) {
diff <- data$y - model1(data, mod)
sqrt(mean(diff ^ 2))
}
measure_distance(sim1, c(12, 0.5))## [1] 4.995789
So, to choose between models – we can just put in different values.
## [1] 2.665212
So, we can generate a lot of models and pick the one with the lowest error.
models <- tibble(
a1 = runif(250, -20, 40),
a2 = runif(250, -5, 5)
)
sim1_dist <- function(a1, a2) {
measure_distance(sim1, c(a1, a2))
}
models %>%
mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist)) -> models
models## # A tibble: 250 x 3
## a1 a2 dist
## <dbl> <dbl> <dbl>
## 1 11.1 -0.485 10.4
## 2 -14.9 -0.187 32.2
## 3 27.1 1.70 21.1
## 4 -15.9 -3.23 51.5
## 5 5.41 0.851 6.76
## 6 -17.7 3.70 13.8
## 7 15.1 -2.09 17.0
## 8 5.10 -4.94 42.6
## 9 11.4 -3.74 29.8
## 10 -9.69 -3.77 48.9
## # ... with 240 more rows
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
geom_point()## # A tibble: 10 x 3
## a1 a2 dist
## <dbl> <dbl> <dbl>
## 1 1.39 2.31 2.66
## 2 3.57 1.88 2.72
## 3 7.42 1.34 3.05
## 4 -0.0249 2.50 3.06
## 5 8.54 1.62 3.14
## 6 11.0 1.33 4.06
## 7 -2.00 2.53 4.40
## 8 -3.00 2.63 4.87
## 9 -3.52 3.56 4.87
## 10 -5.34 3.26 5.00
Plot those models on the data. Lighter colored models are closer.
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, color = "grey30") +
geom_abline(
aes(intercept = a1, slope = a2, color = -dist),
data = filter(models, rank(dist) <= 10)
)Now, we switch to parameter space… We plot every candidate model - showing the distance in color. We highlight our 10 best models in red.
ggplot(models, aes(a1, a2)) +
geom_point(data = filter(models, rank(dist) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist))So, you can zoom in on the area where the best models tend to be. Note the axes and the legend.
What do we get as the new best 10 models?
## a1 a2 dist
## 1 4.375000 2.000000 2.137234
## 2 4.375000 2.083333 2.155409
## 3 3.333333 2.166667 2.168677
## 4 5.416667 1.916667 2.210294
## 5 3.333333 2.250000 2.212637
## 6 5.416667 1.833333 2.218550
## 7 4.375000 1.916667 2.241533
## 8 3.333333 2.083333 2.246169
## 9 4.375000 2.166667 2.293149
## 10 2.291667 2.333333 2.308274
## a1 a2 dist
## 1 4.183673 2.061224 2.128424
## 2 4.693878 1.979592 2.139589
## 3 4.183673 2.020408 2.140222
## 4 3.673469 2.142857 2.144759
## 5 4.183673 2.102041 2.146651
## 6 3.673469 2.102041 2.150084
## 7 4.693878 2.020408 2.151341
## 8 4.693878 1.938776 2.157704
## 9 3.673469 2.183673 2.169193
## 10 5.204082 1.897959 2.177829
grid %>%
ggplot(aes(a1, a2)) +
geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist))grid %>%
ggplot(aes(a1, a2)) +
geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist)) +
geom_contour(aes(z = -dist), colour = "green")fine_grid %>%
ggplot(aes(a1, a2)) +
geom_point(data = filter(fine_grid, rank(dist) <= 10), size = 3, colour = "red") +
geom_point(aes(colour = -dist)) +
geom_contour(aes(z = -dist), colour = "orange", bins = 15, size = 1.5)ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a1, slope = a2, colour = -dist),
data = filter(grid, rank(dist) <= 10)
)There is a tool to optimize the parameters so you get the smallest distance.
## [1] 4.222248 2.051204
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(intercept = best$par[1], slope = best$par[2])Of course, R has a tool to do this whole process automatically. This is our standard OLS regression.
## (Intercept) x
## 4.220822 2.051533
We can use a model object like sim1_mod to generate predictions. We an start with an empty grid.
## # A tibble: 10 x 1
## x
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## # A tibble: 10 x 2
## x pred
## <int> <dbl>
## 1 1 6.27
## 2 2 8.32
## 3 3 10.4
## 4 4 12.4
## 5 5 14.5
## 6 6 16.5
## 7 7 18.6
## 8 8 20.6
## 9 9 22.7
## 10 10 24.7
geom_line.ggplot(sim1, aes(x)) +
geom_point(aes(y = y)) +
geom_line(aes(y = pred), data = grid, colour = "red", size = 1)Add residuals. We need to use the original dataset.
## # A tibble: 30 x 3
## x y resid
## <int> <dbl> <dbl>
## 1 1 4.20 -2.07
## 2 1 7.51 1.24
## 3 1 2.13 -4.15
## 4 2 8.99 0.665
## 5 2 10.2 1.92
## 6 2 11.3 2.97
## 7 3 7.36 -3.02
## 8 3 10.5 0.130
## 9 3 10.5 0.136
## 10 4 12.4 0.00763
## # ... with 20 more rows
Things to try:
TODO: Add more content here.
data = .~ is the = we would write in an equation. y ~ x translates to y = a_1 + a_2 * xmodel_matrix functiont.test - use the formula language – come up with examples.lm - also use the formula languageStart here:
Additional:
For help/community: